fix(downloads): keep DownloadMonitorService alive when a poll cycle hits a transient SQLite lock by kevinheneveld · Pull Request #654 · Listenarrs/Listenarr

kevinheneveld · 2026-06-06T19:59:54Z

Summary

DownloadMonitorService's poll loop caught only OperationCanceledException, so any other exception thrown during a cycle escaped ExecuteAsync and tripped the .NET host's default BackgroundServiceExceptionBehavior.StopHost — shutting the whole app down. Under a container restart policy that manifests as a crash-loop. This makes the monitor loop log-and-continue on a failed cycle, matching every sibling background service.

Why this matters

A transient Microsoft.Data.Sqlite.SqliteException ("database is locked") is a normal, recoverable event under concurrent writes — it's raised while persisting a download (e.g. EfDownloadRepository.UpdateAsync/GetByIdAsync). With the old loop, a single such exception didn't just fail one cycle; it took the entire host down. On an instance running a container restart policy this became a crash-loop, restarting roughly every poll cycle (observed ~18 restarts in 9 hours), and each restart re-ran startup work — so a recoverable lock turned into sustained downtime and wasted CPU.

DownloadMonitorService was the lone background service with this gap: QueueMonitorService, MovedDownloadProcessor, and AutomaticSearchService already log-and-continue.

Implementation notes

Extracted the poll loop into internal async Task RunMonitorLoopAsync(CancellationToken) so the resilience behavior is isolated and unit-testable (the public ExecuteAsync keeps its 5s polling interval, which a test can't wait on).
The loop now distinguishes three cases per cycle:
- OperationCanceledException with cancellation requested → break (graceful shutdown, unchanged).
- OperationCanceledException without cancellation → a transient timeout within a cycle; logged and the loop continues.
- Any other exception → logged ("Error in Download Monitor Service") and the loop continues. OutOfMemoryException/StackOverflowException are deliberately not swallowed.
Surface area: one application-layer service file (DownloadMonitorService.cs) plus a new test. No API, schema, or behavioral change to what the monitor does on a successful cycle.
Backwards compatible: identical behavior on the happy path and on cancellation; the only change is that a failed cycle no longer terminates the host.

What this PR does NOT include

The underlying SQLite write contention. The connection string sets no busy_timeout, so concurrent writes can still raise transient locks — this PR stops a single transient lock from being fatal, but reducing how often it occurs is a separate hardening change.

Test plan

cd tests && dotnet test — 683 / 683 passing (1 new regression test: RunMonitorLoopAsync_CycleThrows_KeepsLoopingInsteadOfStoppingHost drives the extracted loop, forces a cycle to throw, and asserts the loop task is still running rather than completed).
No frontend changes (backend-only).

…sient SQLite lock DownloadMonitorService's poll loop caught only OperationCanceledException, so any other exception from a cycle propagated out of ExecuteAsync. A BackgroundService that throws trips the .NET host's default BackgroundServiceExceptionBehavior.StopHost, so the host shut down gracefully (exit 0) and the container's unless-stopped policy restarted it. In practice the cycle threw a Microsoft.Data.Sqlite "database is locked" SqliteException while persisting a download (EfDownloadRepository.UpdateAsync/ GetByIdAsync) under write contention, restarting the whole app every poll cycle (~18 restarts in 9h live). Each restart re-ran startup scans and reset the stall-reaper's in-memory 60-minute timers so it never reaped, keeping CPU pegged. The sibling background services (QueueMonitorService, MovedDownloadProcessor, AutomaticSearchService) already log-and-continue on a failed cycle. The poll loop is extracted to RunMonitorLoopAsync (for isolation + a fast, parallel-safe unit test) and now swallows-and-continues on any non-cancellation, non-fatal exception instead of letting it stop the host. The underlying lock contention (no busy_timeout on the SQLite connection) is a separate hardening follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

kevinheneveld marked this pull request as ready for review June 6, 2026 20:45

kevinheneveld requested a review from a team June 6, 2026 20:45

therobbiedavis assigned T4g1 Jun 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(downloads): keep DownloadMonitorService alive when a poll cycle hits a transient SQLite lock#654

fix(downloads): keep DownloadMonitorService alive when a poll cycle hits a transient SQLite lock#654
kevinheneveld wants to merge 1 commit into
Listenarrs:canaryfrom
kevinheneveld:fix/download-monitor-resilience

kevinheneveld commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

kevinheneveld commented Jun 6, 2026

Summary

Why this matters

Implementation notes

What this PR does NOT include

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants